In many data-driven discriminative tasks, without specific priori knowledge, scientists often start with popular classifier models in machine learning. If works, bingo! A happy ending. the alorithm can be deployed into production and the researcch get published. Otherwise, continue to try other more sophisticated model. If all major models fail, we have to turn to other detection modalities, which usually mean the modality generating the current data set doesn't capture the information we need for the task.
A fundamental question beneath the above "trial and error" (Trial and error is a fundamental method of problem solving. It is characterised by repeated, varied attempts which are continued until success, or until the practicer stops trying) practice:
Does the dataset possess the statistical differences between different groups/classes? or Are the samples drawn from different distributions (generation processes)?
This reveals an implicit and often neglected pre-assessment step in the entire pipeline: the dataset classifiability analysis
It is the lowest possible test error rate in classification which is produced by the Bayes classifier. It is analogous to the irreducible error rate.
Because of noise (inherently stochastic), the error incurred by the oracle prediction model from the true distribution p(x,y) is the Bayes error.
To calculate BER, we can use the Gaussian Bayes model, based on the naive assumption:
The features (in this case, PCs returned from PCA in the latent space) conform to Multivariate Gaussian Distributions.
Each class corresponds to one Gaussian distribution.
The Bayes optimal decision boundary will correspond to the point where two densities are equal.
According to the central limit theorem (CLT), if a statistic is a sum or average of repetitive measurements, it will be approximately normal under certain technical conditions regardless of the distribution of the individual samples. Each wave number in Raman or tof in ms, from physical process, is a an accumulation of measurement of photons/ions/particles. Features are uncorreleted. In micro level, physical process, photons/ions/particles of different frequency/energy/mass/electric charge don't interfere with each other. The coocurrence / correlation between features is a macro relation. i.e., Peak patterns.
Optional: check feature normality, e.g., by Q-Q plot
To calculate BER, we can either use:
We need to integrate over the density that is not the highest one for each point. As there are two densities, we need to sum up the two integrals. We can use numerical integration package, e.g. scipy.integrate
use Gaussian NB classifier, e.g. sklearn.naive_bayes.GaussianNB
predict_proba() returns the probability of the samples for each class in the model
Choose SVM (support vector machine) as the base model and use K-fold CV (cross-validation) to pick up the best model.
CV ensures the classifier has a proper generalization capability (neither underfit or overfit)
Information gain has been used in decision tree. For a specific feature, Information gain (IG) measures how much “information” a feature gives us about the class.
$IG(Y|X) = H(Y) - H(Y|X) $
In information theory, IG answers "if we transmit Y, how many bits can be saved if both sender and receiver know X?" Or "how much information of Y is implied in X?"
Attribute/feature X with a high IG is a good split on Y.
It can be proven that Info Gain = Mutual information
We can use mutual_info_classif to calculate info gain.
Be cautious about the discrete_features parameter : {‘auto’, bool, array_like}, default ‘auto’
If bool, then determines whether to consider all features discrete or continuous. If array, then it should be either a boolean mask with shape (n_features,) or array with indices of discrete features. If ‘auto’, it is assigned to False for dense X and to True for sparse X.
For continous feature, use discrete_features = False.
Notes
The term “discrete features” is used instead of naming them “categorical”, because it describes the essence more accurately. For example, pixel intensities of an image are discrete features (but hardly categorical) and you will get better results if mark them as such. Also note, that treating a continuous variable as discrete and vice versa will usually give incorrect results, so be attentive about that.
True mutual information can’t be negative. If its estimate turns out to be negative, it is replaced by zero.
古井贡酒不同年份拉曼光谱 Raman spectra of Gujing Tribute Liquor of different ages
价格与年份基本正比,以500ml瓶装为例, 5-year 200RMB, 8-year 300RMB, 16-year 600RMB, 26-year 1700RMB
With a long history, Gujing Tribute Liquor with fragrant taste is one of the eight most famous liquors in China. In 196AD, Cao Cao presented the "Jiuyun Spring Liquor" that was produced in his hometown as the royal liquor, as well as its brewing methods to the Emperor Xian of Han Dynasty. During the Wan Li Reign of Ming Dynasty, it was presented to the royal court as a "tribute" all the way until Qing Dynasty, hence the liquor is named "Gujing Tribute Liquor" . On the basis of traditional processes, it has scientific recipes and technological innovations. It features "crystal clear, sweet and mellow like orchid, velvety and lasting after tasting" and brings a unique taste known for its sweetness, aroma and full flavor. It was awarded the gold medal of the national liquor-tasting conference for four times, and won the title of "National Famous Liquor". In March 2003, it was incorporated into the system for protecting original products. In 2005, it became national geographical iconic products, gained wide acclaim and has been popular both at home and abroad. (source: http://english.bozhou.gov.cn/content/33.html)
from qsi import io
X,y,X_names,_,y_names = io.load_dataset("vintage_526", shift = 3000)
load dataset from 7344_Y5Y26.csv X.shape (121, 2089) y.shape (121,)
7344_Y5Y26.csv - Raman spectroscopic profiling dataset of 5-year and 26-year Gujing Tribute vintage liquors. y = 0: 5-year y = 1: 26-year Each sample has 2088 Raman wavenumbers, ranging from 251 to 2338 cm-1. Three outlier samples were removed. -------------------- If you use this data set, please add the reference: [1] A unified classifiability analysis framework based on meta-learner and its application in spectroscopic profiling data [J]. Applied Intelligence, 2021, doi: 10.1007/s10489-021-02810-8
# io.scatter_plot(X, y, tags = list(range(len(X))))
Additional preprocessing steps
Highly recommendated for high-dimensional physio-chemical spectroscopic data, e.g., Raman and MALDI-TOF.
Priori knowledge: The Raman spectrum data only contains additive or linear structures (each chemical bond or particle corresponds to several wave numbers), but no complex embedding structures. So non-linear dimensionality reduction methods, such as kernel PCA, LLE, t-SNE, do not suit.
Coefficients from LASSO or ElasticNet depend on the magnitude of each variable. It is therefore necessary to rescale, or standardize, the variables.
The result of centering the variables means that there is no longer an intercept.
Without feature scaling, the feature selection result can be quite different!
import cla.metrics
X,y = cla.metrics.mvg(
nobs = 100, # number of observations / samples
md = 0 # distance between means, respect to std, i.e. (mu2 - mu1) / std, or how many stds is the difference.
)
Unable to determine R home: [WinError 2] The system cannot find the file specified
rpy2 3.X may not support Windows. ECoL metrics may not be available.
import matplotlib
matplotlib.rcParams.update({'font.size': 10})
from sklearn.decomposition import PCA
X_pca = PCA(n_components = 2).fit_transform(X)
cla.metrics.plotComponents2D(X_pca, y)
<AxesSubplot:>
X,y = cla.metrics.mvg(
nobs = 100, # number of observations / samples
md = 0.5 # distance between means, respect to std, i.e. (mu2 - mu1) / std, or how many stds is the difference.
)
from sklearn.decomposition import PCA
X_pca = PCA(n_components = 2).fit_transform(X)
cla.metrics.plotComponents2D(X_pca, y)
<AxesSubplot:>
以上分别产生了类间距(md)为0和 0.5 std 的两个数据集。PCA降维后的可视化效果符合预期。
Use sklearn.naive_bayes.GaussianNB
predict_proba() returns the probability of the samples for each class in the model
import matplotlib.pyplot as plt
bs = []
for NS in range(1, 40):
b, _ = cla.metrics.BER(X,y, NSigma = NS, save_fig = '')
bs.append(b)
plt.plot(bs)
[<matplotlib.lines.Line2D at 0x2cfe1925e20>]
BER随着采样范围增加而减小,在10sigma 左右达到稳定
bs = []
for m in [10,100,1000,10000,100000]:
b,_ = cla.metrics.BER(X,y, nobs = int(m), save_fig = '')
bs.append(b)
plt.plot(bs)
[<matplotlib.lines.Line2D at 0x2cfe6f23340>]
BER受随机采样点数影响不大,0.1~0.11范围
Use cv to get the best classifier. It returns accuracy and decision boundary vertices.
# NOTE: the CLF in pyCLAMs
dct, *_ = cla.metrics.CLF(X,y, show = True)
dct['classification.ACC']
0.59
We prefer Info Gain over Correlation, because:
Correlation only measures the linear relationship (Pearson's correlation) or monotonic relationship (Spearman's correlation) between two variables.
Mutual information is more general and measures the reduction of uncertainty in Y after observing X. It is the KL distance between the joint density and the product of the individual densities. So MI can measure non-monotonic relationships and other more complicated relationships.
mi,_ = cla.metrics.IG(X, y, show = True)
Let's define a unified metric:
$ D = w_0 + w_1×BER + w_2×ACC + w_3×IG + … $
We use mvg() to generate datasets of different degrees of classifiability.
from qsi import io
X,y,X_names,_,y_names = io.load_dataset("vintage_526", display = False) # salt
X, X_names = io.pre.x_binning(X, X_names, target_dim=0.1, flavor='max') # flavor = 'sum'
print('处理后数据维度:X.shape = ', X.shape)
load dataset from 7344_Y5Y26.csv X.shape (121, 2089) y.shape (121,) 7344_Y5Y26.csv - Raman spectroscopic profiling dataset of 5-year and 26-year Gujing Tribute vintage liquors. y = 0: 5-year y = 1: 26-year Each sample has 2088 Raman wavenumbers, ranging from 251 to 2338 cm-1. Two outlier samples was removed. -------------------- If you use this data set, please add the reference: [1] A unified classifiability analysis framework based on meta-learner and its application in spectroscopic profiling data [J]. Applied Intelligence, 2021, doi: 10.1007/s10489-021-02810-8 处理后数据维度:X.shape = (121, 208)
from cla.unify import calculate_atom_metrics
import numpy as np
dic = calculate_atom_metrics(mu = X.mean(axis = 0), s = X.std(axis = 0),
mds = np.linspace(0, 3, 4+3*4),
# repeat = 5, nobs = 100,
show_curve = True, show_html = True)
0%| | 0/48 [00:00<?, ?it/s]R[write to console]: Loading required package: ECoL 100%|██████████████████████████████████████████████████████████████████████████████████| 48/48 [30:56<00:00, 31.62s/it]
visualize_dict()
generate_html_for_dict()
| d | 0.0 | 0.2 | 0.4 | 0.6 | 0.8 | 1.0 | 1.2 | 1.4 | 1.6 | 1.8 | 2.0 | 2.2 | 2.4 | 2.6 | 2.8 | 3.0 |
| classification.ACC | 0.932 | 0.975 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.Kappa | 0.863 | 0.95 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.F1_Score | 0.93 | 0.975 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.Jaccard | 0.883 | 0.952 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.Precision | 0.94 | 0.968 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.Recall | 0.92 | 0.983 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.McNemar | 0.476 | 0.396 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| classification.McNemar.CHI2 | inf | inf | inf | inf | inf | inf | inf | inf | inf | inf | inf | inf | inf | inf | inf | inf |
| classification.CochranQ | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| classification.CochranQ.T | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| classification.CrossEntropy | 0.235 | 0.169 | 0.021 | 0.009 | 0.014 | 0.041 | 0.058 | 0.065 | 0.051 | 0.042 | 0.036 | 0.029 | 0.026 | 0.023 | 0.02 | 0.018 |
| classification.Mean_KLD | 0.235 | 0.169 | 0.021 | 0.009 | 0.014 | 0.041 | 0.058 | 0.065 | 0.051 | 0.042 | 0.036 | 0.029 | 0.026 | 0.023 | 0.02 | 0.018 |
| classification.AP | 0.942 | 0.995 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.Brier | 0.071 | 0.042 | 0.001 | 0.0 | 0.001 | 0.005 | 0.006 | 0.005 | 0.003 | 0.002 | 0.002 | 0.001 | 0.001 | 0.001 | 0.001 | 0.0 |
| classification.ROC_AUC | 0.948 | 0.995 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.PR_AUC | 0.94 | 0.995 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 | 1.0 |
| classification.BER | 0.007 | 0.006 | 0.007 | 0.006 | 0.005 | 0.004 | 0.004 | 0.003 | 0.003 | 0.003 | 0.003 | 0.003 | 0.002 | 0.002 | 0.002 | 0.002 |
| classification.SVM.Margin | 5.359 | 12.935 | 27.626 | 47.547 | 74.555 | 100.34 | 129.178 | 174.038 | 225.507 | 244.66 | 307.292 | 325.335 | 379.776 | 420.714 | 484.931 | 530.5 |
| correlation.IG.max | 0.097 | 0.108 | 0.131 | 0.167 | 0.198 | 0.234 | 0.271 | 0.326 | 0.389 | 0.433 | 0.47 | 0.508 | 0.562 | 0.58 | 0.614 | 0.642 |
| correlation.r.max | 0.192 | 0.282 | 0.364 | 0.455 | 0.515 | 0.584 | 0.625 | 0.68 | 0.73 | 0.75 | 0.787 | 0.808 | 0.837 | 0.851 | 0.868 | 0.881 |
| correlation.r.p.min | 0.007 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| correlation.rho.max | 0.193 | 0.278 | 0.362 | 0.463 | 0.525 | 0.599 | 0.642 | 0.702 | 0.757 | 0.779 | 0.802 | 0.827 | 0.844 | 0.85 | 0.859 | 0.862 |
| correlation.rho.p.min | 0.008 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| correlation.tau.max | 0.158 | 0.228 | 0.296 | 0.379 | 0.43 | 0.491 | 0.526 | 0.575 | 0.62 | 0.637 | 0.656 | 0.677 | 0.691 | 0.696 | 0.703 | 0.705 |
| correlation.tau.p.min | 0.008 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| test.ES.max | 0.39 | 0.586 | 0.779 | 1.016 | 1.195 | 1.431 | 1.592 | 1.847 | 2.129 | 2.254 | 2.543 | 2.733 | 3.045 | 3.234 | 3.474 | 3.717 |
| test.student.min | 0.007 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| test.student.min.log10 | -2.199 | -4.308 | -6.97 | -10.888 | -14.154 | -18.895 | -22.309 | -27.699 | -33.915 | -36.577 | -42.675 | -46.381 | -52.8 | -56.457 | -61.002 | -65.474 |
| test.student.T.max | 2.647 | 1.274 | 0.257 | 0.074 | -0.008 | -1.107 | 0.801 | -0.713 | 0.127 | -0.798 | 0.049 | 0.569 | 0.541 | 0.122 | -0.015 | -0.291 |
| test.ANOVA.min | 0.007 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| test.ANOVA.min.log10 | -2.2 | -4.309 | -6.971 | -10.888 | -14.205 | -18.942 | -22.309 | -27.805 | -33.915 | -36.625 | -42.773 | -46.711 | -53.02 | -56.724 | -61.294 | -65.741 |
| test.ANOVA.F.max | 7.627 | 17.261 | 30.495 | 51.75 | 71.406 | 102.472 | 126.72 | 170.709 | 227.302 | 254.138 | 323.477 | 373.73 | 464.177 | 523.2 | 603.549 | 691.646 |
| test.MANOVA | 0.329 | 0.244 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| test.MANOVA.log10 | -0.808 | -0.764 | -3.932 | -12.044 | -15.654 | -15.654 | -15.654 | -15.654 | -15.654 | -15.654 | -15.654 | -15.654 | -15.654 | -15.654 | -15.654 | -15.654 |
| test.MANOVA.F | 1.892 | 1.781 | 9.492 | 17.409 | 32.705 | 48.02 | 72.471 | 89.9 | 143.84 | 141.064 | 198.101 | 267.998 | 283.932 | 378.748 | 425.698 | 412.287 |
| test.MWW.min | 0.008 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| test.MWW.min.log10 | -2.197 | -4.084 | -6.504 | -10.179 | -12.879 | -16.556 | -18.87 | -22.411 | -25.922 | -27.347 | -28.94 | -30.704 | -31.987 | -32.372 | -33.033 | -33.266 |
| test.MWW.U.min | 4014.0 | 3392.667 | 2910.0 | 2328.0 | 1969.667 | 1539.667 | 1293.333 | 945.333 | 628.0 | 503.667 | 370.0 | 226.333 | 124.667 | 94.333 | 43.0 | 25.0 |
| test.KS.min | 0.008 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| test.KS.min.log10 | -2.149 | -4.121 | -5.76 | -9.501 | -11.811 | -15.967 | -17.366 | -23.805 | -28.144 | -31.137 | -35.083 | -38.335 | -41.508 | -45.376 | -49.252 | -52.58 |
| test.KS.D.max | 0.237 | 0.317 | 0.37 | 0.467 | 0.517 | 0.593 | 0.617 | 0.71 | 0.763 | 0.797 | 0.837 | 0.867 | 0.893 | 0.923 | 0.95 | 0.97 |
| test.CHISQ.min | 0.474 | 0.268 | 0.15 | 0.078 | 0.029 | 0.016 | 0.008 | 0.003 | 0.001 | 0.001 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| test.CHISQ.min.log10 | -0.324 | -0.575 | -0.835 | -1.111 | -1.535 | -1.805 | -2.086 | -2.586 | -2.862 | -3.219 | -3.573 | -3.931 | -4.278 | -4.67 | -5.15 | -5.219 |
| test.CHISQ.CHI2.max | 0.513 | 1.238 | 2.115 | 3.121 | 4.756 | 5.84 | 6.988 | 9.076 | 10.243 | 11.768 | 13.288 | 14.837 | 16.347 | 18.062 | 20.174 | 20.475 |
| test.KW.min | 0.008 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| test.KW.min.log10 | -2.198 | -4.086 | -6.507 | -10.182 | -12.883 | -16.561 | -18.875 | -22.416 | -25.928 | -27.353 | -28.946 | -30.711 | -31.993 | -32.379 | -33.039 | -33.273 |
| test.KW.H.max | 7.462 | 15.522 | 26.187 | 42.643 | 54.839 | 71.518 | 82.042 | 98.173 | 114.196 | 120.702 | 127.983 | 136.049 | 141.914 | 143.677 | 146.698 | 147.765 |
| test.Median.min | 0.0 | 0.0 | 0.007 | 0.12 | 1.353 | 1.3 | 1.993 | 0.067 | 0.167 | 0.873 | 0.867 | 0.873 | 0.66 | 0.127 | 1.033 | 0.34 |
| test.Median.min.log10 | -inf | -inf | -inf | -inf | -0.369 | 0.025 | 0.056 | -inf | -inf | -0.687 | -inf | -0.687 | -0.427 | -1.063 | -0.033 | -1.136 |
| test.Median.CH2.max | 1.0 | 1.0 | 0.963 | 0.781 | 0.403 | 0.307 | 0.286 | 0.853 | 0.827 | 0.56 | 0.508 | 0.56 | 0.515 | 0.743 | 0.335 | 0.699 |
| overlapping.F1.mean | 0.058 | 0.105 | 0.192 | 0.286 | 0.368 | 0.445 | 0.515 | 0.584 | 0.638 | 0.689 | 0.724 | 0.76 | 0.785 | 0.805 | 0.821 | 0.832 |
| overlapping.F1.sd | 0.042 | 0.061 | 0.068 | 0.07 | 0.068 | 0.063 | 0.059 | 0.06 | 0.061 | 0.058 | 0.058 | 0.056 | 0.056 | 0.057 | 0.055 | 0.054 |
| overlapping.F1v.mean | 0.713 | 0.7 | 0.693 | 0.692 | 0.696 | 0.709 | 0.731 | 0.746 | 0.75 | 0.75 | 0.75 | 0.75 | 0.75 | 0.75 | 0.75 | 0.75 |
| overlapping.F1v.sd | 0.013 | 0.009 | 0.005 | 0.006 | 0.012 | 0.022 | 0.022 | 0.012 | 0.004 | 0.003 | 0.003 | 0.002 | 0.002 | 0.003 | 0.003 | 0.003 |
| overlapping.F2.mean | 1.0 | 0.995 | 0.955 | 0.342 | 0.067 | 0.043 | 0.168 | 0.128 | 0.172 | 0.208 | 0.128 | 0.163 | 0.152 | 0.175 | 0.32 | 0.282 |
| overlapping.F2.sd | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| overlapping.F3.mean | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| overlapping.F3.sd | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| overlapping.F4.mean | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan | nan |
| overlapping.F4.sd | 0.001 | 0.001 | 0.001 | 0.001 | 0.002 | 0.001 | 0.001 | 0.0 | 0.001 | 0.0 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 | 0.001 |
| neighborhood.N1 | 0.009 | 0.02 | 0.01 | 0.013 | 0.025 | 0.016 | 0.02 | 0.001 | 0.011 | 0.006 | 0.01 | 0.007 | 0.012 | 0.012 | 0.015 | 0.014 |
| neighborhood.N2.mean | 0.504 | 0.389 | 0.141 | 0.008 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 | 0.005 |
| neighborhood.N2.sd | 0.501 | 0.487 | 0.348 | 0.09 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 | 0.071 |
| neighborhood.N3.mean | 0.207 | 0.207 | 0.203 | 0.199 | 0.194 | 0.19 | 0.183 | 0.178 | 0.172 | 0.168 | 0.162 | 0.158 | 0.154 | 0.152 | 0.146 | 0.143 |
| neighborhood.N3.sd | 0.011 | 0.011 | 0.011 | 0.01 | 0.01 | 0.01 | 0.01 | 0.011 | 0.012 | 0.012 | 0.013 | 0.014 | 0.015 | 0.016 | 0.017 | 0.019 |
| neighborhood.N4.mean | 0.508 | 0.387 | 0.132 | 0.003 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| neighborhood.N4.sd | 0.501 | 0.488 | 0.339 | 0.047 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
| neighborhood.T1.mean | 0.002 | 0.002 | 0.002 | 0.003 | 0.001 | 0.001 | 0.002 | 0.001 | 0.001 | 0.002 | 0.003 | 0.002 | 0.003 | 0.001 | 0.003 | 0.002 |
| neighborhood.T1.sd | 0.028 | 0.033 | 0.022 | 0.035 | 0.015 | 0.008 | 0.033 | 0.012 | 0.007 | 0.032 | 0.047 | 0.027 | 0.04 | 0.009 | 0.049 | 0.025 |
| neighborhood.LSC | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 | 1.04 |
| linearity.L1.mean | 0.157 | 0.153 | 0.157 | 0.145 | 0.142 | 0.128 | 0.123 | 0.11 | 0.1 | 0.087 | 0.085 | 0.078 | 0.07 | 0.067 | 0.06 | 0.057 |
| linearity.L1.sd | 0.151 | 0.147 | 0.151 | 0.139 | 0.136 | 0.123 | 0.119 | 0.106 | 0.096 | 0.083 | 0.082 | 0.075 | 0.067 | 0.064 | 0.058 | 0.054 |
import pickle
import joblib
joblib.dump(dic, 'vintage_526_atom_metrics.pkl')
['vintage_526_atom_metrics.pkl']
import pickle
import joblib
dic = joblib.load('vintage_526_atom_metrics.pkl')
记录计算得到的atom metrics
from cla.unify import train_metalearner, calculate_unified_metric
model = train_metalearner(M, dic['d'], cutoff = 2)
Score: 1.0 Coef and Intercept: [[-3.47298416e-08 3.12323602e-02 2.42807027e-05 2.73138141e-05 2.16828499e-05 1.77481776e-05 1.63564450e-04 -3.29321761e-03 -3.32046375e-03 3.57354730e-02 -2.05716484e-04 2.79688743e-02 -1.13816207e-03 -5.65481074e-02 -2.29551345e-03 2.96614339e-05 -2.16302991e-04 9.02065961e-04 -1.13832441e-03 5.14304884e-03 2.69952584e-05 1.03266718e-05 -2.64035099e-07 6.79523532e-07 -2.06381078e-07 -1.98443344e-07]] [1.3719713e-05]
from cla.unify import calculate_unified_metric
calculate_unified_metric(X, y, model, keys, method = 'meta.logistic')
KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0).
Exception ignored from cffi callback <function _consolewrite_ex at 0x000001725F2DE940>:
Traceback (most recent call last):
File "C:\Users\eleve\anaconda3\lib\site-packages\rpy2\rinterface_lib\callbacks.py", line 133, in _consolewrite_ex
s = conversion._cchar_to_str_with_maxlen(buf, n, _CCHAR_ENCODING)
File "C:\Users\eleve\anaconda3\lib\site-packages\rpy2\rinterface_lib\conversion.py", line 138, in _cchar_to_str_with_maxlen
s = ffi.string(c, maxlen).decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 0: invalid continuation byte
KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). c = 0 , in-class unified metric = 1.0493593448343555e-05 KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). c = 1 , in-class unified metric = 2.127618655446456e-06
(1.0, [1.0493593448343555e-05, 2.127618655446456e-06])
This is an all-in-one example. It reuses a pickle data, and doesn't use filter
from qsi import io
X,y,X_names,_,y_names = io.load_dataset("vintage_526", display = False) # salt
X, X_names = io.pre.x_binning(X, X_names, target_dim=0.1, flavor='max') # flavor = 'sum'
print('处理后数据维度:X.shape = ', X.shape)
load dataset from 7344_Y5Y26.csv X.shape (121, 2089) y.shape (121,) 7344_Y5Y26.csv - Raman spectroscopic profiling dataset of 5-year and 26-year Gujing Tribute vintage liquors. y = 0: 5-year y = 1: 26-year Each sample has 2088 Raman wavenumbers, ranging from 251 to 2338 cm-1. Two outlier samples was removed. -------------------- If you use this data set, please add the reference: [1] A unified classifiability analysis framework based on meta-learner and its application in spectroscopic profiling data [J]. Applied Intelligence, 2021, doi: 10.1007/s10489-021-02810-8 处理后数据维度:X.shape = (121, 208)
from cla.unify import analyze
analyze(X,y, use_filter = True, method = 'meta.linear', pkl = '20221126231414.365863.pkl')
Load atom metrics from 20221126231414.365863.pkl before filter
Metrics above the threshold (0.5): ['classification.CrossEntropy' 'classification.Mean_KLD' 'classification.BER' 'classification.SVM.Margin' 'correlation.IG.max' 'correlation.r.max' 'correlation.rho.max' 'correlation.tau.max' 'test.ES.max' 'test.student.min.log10' 'test.ANOVA.min.log10' 'test.ANOVA.F.max' 'test.MANOVA.F' 'test.MWW.min.log10' 'test.MWW.U.min' 'test.KS.min.log10' 'test.KS.D.max' 'test.CHISQ.min.log10' 'test.CHISQ.CHI2.max' 'test.KW.min.log10' 'test.KW.H.max' 'overlapping.F1.mean' 'overlapping.F1v.mean' 'neighborhood.N3.mean' 'neighborhood.N3.sd' 'linearity.L1.mean' 'linearity.L1.sd'] Coef and Intercept: [-8.99684453e+00 -8.99684452e+00 -5.18337153e+00 2.37537900e-03 -2.41506876e+00 2.72898899e+01 4.89177112e+00 4.00408723e+00 -3.28383202e+00 -2.28564022e-02 -3.70104037e-02 3.20868682e-03 -3.51051937e-04 1.36222052e+01 -5.82097617e-03 -6.70669494e-02 -8.80223782e+00 2.68495633e+00 6.82368943e-01 1.35767182e+01 5.84758747e+00 -1.84780560e+01 2.20749327e+01 -2.88257925e+01 -5.09165177e+00 6.41524845e+00 6.16850812e+00] 27.077690979573084 KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0).
Exception ignored from cffi callback <function _consolewrite_ex at 0x0000015E9F557B80>:
Traceback (most recent call last):
File "C:\Users\eleve\anaconda3\lib\site-packages\rpy2\rinterface_lib\callbacks.py", line 133, in _consolewrite_ex
s = conversion._cchar_to_str_with_maxlen(buf, n, _CCHAR_ENCODING)
File "C:\Users\eleve\anaconda3\lib\site-packages\rpy2\rinterface_lib\conversion.py", line 138, in _cchar_to_str_with_maxlen
s = ffi.string(c, maxlen).decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 0: invalid continuation byte
KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). c = 0 , in-class unified metric = [-4.36287921] KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). c = 1 , in-class unified metric = [-13.78564454]
(array([1.20565795e+307]), [array([-4.36287921]), array([-13.78564454])], '20221126231414.365863.pkl')
from cla.unify import analyze
analyze(X,y, use_filter = True, method = 'meta.logistic', pkl = '20221126231414.365863.pkl')
Unable to determine R home: [WinError 2] The system cannot find the file specified
rpy2 3.X may not support Windows. ECoL metrics may not be available. Load atom metrics from 20221126231414.365863.pkl before filter
Metrics above the threshold (0.5): ['classification.CrossEntropy' 'classification.Mean_KLD' 'classification.BER' 'classification.SVM.Margin' 'correlation.IG.max' 'correlation.r.max' 'correlation.rho.max' 'correlation.tau.max' 'test.ES.max' 'test.student.min.log10' 'test.ANOVA.min.log10' 'test.ANOVA.F.max' 'test.MANOVA.F' 'test.MWW.min.log10' 'test.MWW.U.min' 'test.KS.min.log10' 'test.KS.D.max' 'test.CHISQ.min.log10' 'test.CHISQ.CHI2.max' 'test.KW.min.log10' 'test.KW.H.max' 'overlapping.F1.mean' 'overlapping.F1v.mean' 'neighborhood.N3.mean' 'neighborhood.N3.sd' 'linearity.L1.mean' 'linearity.L1.sd'] Score: 1.0 Coef and Intercept: [[-5.73373270e-07 -5.73373270e-07 7.81742374e-09 1.93720782e-02 2.72407895e-05 3.57117946e-05 3.56104936e-05 2.91484452e-05 1.49638818e-04 -2.79887334e-03 -2.78816641e-03 2.58263316e-02 1.98688303e-02 -1.53056212e-03 -3.23749432e-02 -2.18890719e-03 3.90938098e-05 -2.23973768e-04 8.90254823e-04 -1.53082977e-03 6.84549976e-03 3.60751853e-05 2.60570471e-05 4.37587961e-06 6.39158385e-07 3.28793927e-07 3.16148007e-07]] [3.46445846e-05] KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0).
Exception ignored from cffi callback <function _consolewrite_ex at 0x00000246E4556940>:
Traceback (most recent call last):
File "C:\Users\eleve\anaconda3\lib\site-packages\rpy2\rinterface_lib\callbacks.py", line 133, in _consolewrite_ex
s = conversion._cchar_to_str_with_maxlen(buf, n, _CCHAR_ENCODING)
File "C:\Users\eleve\anaconda3\lib\site-packages\rpy2\rinterface_lib\conversion.py", line 138, in _cchar_to_str_with_maxlen
s = ffi.string(c, maxlen).decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 0: invalid continuation byte
KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). c = 0 , in-class unified metric = 5.412587631243967e-05 KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). c = 1 , in-class unified metric = 0.001343643563679917
(1.0, [5.412587631243967e-05, 0.001343643563679917], '20221126231414.365863.pkl')
analyze(X,y, use_filter = True, method = 'decompose.pca', pkl = '20221126231414.365863.pkl')
Load atom metrics from 20221126231414.365863.pkl before filter
Metrics above the threshold (0.5): ['classification.CrossEntropy' 'classification.Mean_KLD' 'classification.BER' 'classification.SVM.Margin' 'correlation.IG.max' 'correlation.r.max' 'correlation.rho.max' 'correlation.tau.max' 'test.ES.max' 'test.student.min.log10' 'test.ANOVA.min.log10' 'test.ANOVA.F.max' 'test.MANOVA.F' 'test.MWW.min.log10' 'test.MWW.U.min' 'test.KS.min.log10' 'test.KS.D.max' 'test.CHISQ.min.log10' 'test.CHISQ.CHI2.max' 'test.KW.min.log10' 'test.KW.H.max' 'overlapping.F1.mean' 'overlapping.F1v.mean' 'neighborhood.N3.mean' 'neighborhood.N3.sd' 'linearity.L1.mean' 'linearity.L1.sd']
Explained Variance Ratios for the first three PCs [8.18354850e-01 1.80890834e-01 5.36143576e-04] KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). c = 0 , in-class unified metric = 520.9017735032552 KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). c = 1 , in-class unified metric = 512.5762879171997 before scaling: -2.507408392083686e+306 [520.9017735032552, 512.5762879171997] PC1 range: -1911.3099069259176 3166.1374349557695 after scaling: 1.0 [0.52097747 0.52261717]
(1.0, array([0.52097747, 0.52261717]), '20221126231414.365863.pkl')
analyze(X,y, use_filter = True, method = 'decompose.lda', pkl = '20221126231414.365863.pkl')
Load atom metrics from 20221126231414.365863.pkl before filter
Metrics above the threshold (0.5): ['classification.CrossEntropy' 'classification.Mean_KLD' 'classification.BER' 'classification.SVM.Margin' 'correlation.IG.max' 'correlation.r.max' 'correlation.rho.max' 'correlation.tau.max' 'test.ES.max' 'test.student.min.log10' 'test.ANOVA.min.log10' 'test.ANOVA.F.max' 'test.MANOVA.F' 'test.MWW.min.log10' 'test.MWW.U.min' 'test.KS.min.log10' 'test.KS.D.max' 'test.CHISQ.min.log10' 'test.CHISQ.CHI2.max' 'test.KW.min.log10' 'test.KW.H.max' 'overlapping.F1.mean' 'overlapping.F1v.mean' 'neighborhood.N3.mean' 'neighborhood.N3.sd' 'linearity.L1.mean' 'linearity.L1.sd']
Explained Variance Ratios for the first three PCs [1.] KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). c = 0 , in-class unified metric = 953.7777989571432 KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). KW Exception: All numbers are identical in kruskal MedianTest Exception: All values are below the grand median (10.0). c = 1 , in-class unified metric = 887.6152800386504 before scaling: -inf [953.7777989571432, 887.6152800386504] PC1 range: -17.22896802736844 33.769284795474746 after scaling: 1.0 [0. 0.]
(1.0, array([0., 0.]), '20221126231414.365863.pkl')
Conclusion: The between-class classifiability metric is much bigger than the in-class value.
from qsi import io
X,y,X_names,_,y_names = io.load_dataset("salt", x_range=list(range(400,1400)))
X, X_names = io.pre.x_binning(X, X_names, target_dim=0.1, flavor='max') # flavor = 'sum'
print('处理后数据维度:X.shape = ', X.shape)
load dataset from 7545.csv X.shape (125, 1000) y.shape (125,)
食盐拉曼光谱
["jinzhihaiyan(zhongyan):精制海盐(中盐集团)加碘盐",
"shenjinkuangyan(zhongyan):深井矿盐(中盐集团)加碘盐",
"aozhoutianranhaiyan(huaiyan):澳洲天然海盐(淮盐)未加碘",
"aozhouxueyan(huaiyan):澳洲雪盐(淮盐)未加碘"]
We use the 2nd and 3rd classes. 1 - well salt (China), 2 - sea salt (Austrilia)
仪器名称: 激光拉曼光谱 Laser Raman Spectrometer
生产厂家: 美国 Enwave Optronics U.S.A.
仪器型号: Prott-ezRaman-d3
测试参数: 激光波长 785 nm
激光功率 450 mW Max模式
CCD -85℃
积分时间30s
处理后数据维度:X.shape = (125, 100)
from cla.unify import analyze
analyze(X,y, use_filter = True, pkl = '20221127191822.217812.pkl', method = 'meta.linear')
Load atom metrics from 20221127191822.217812.pkl before filter
Metrics above the threshold (0.5): ['classification.BER' 'classification.SVM.Margin' 'correlation.IG.max' 'correlation.r.max' 'correlation.rho.max' 'correlation.tau.max' 'test.ES.max' 'test.student.min.log10' 'test.student.T.max' 'test.ANOVA.min.log10' 'test.ANOVA.F.max' 'test.MANOVA.F' 'test.MWW.min.log10' 'test.MWW.U.min' 'test.KS.min.log10' 'test.KS.D.max' 'test.CHISQ.min.log10' 'test.CHISQ.CHI2.max' 'test.KW.min.log10' 'test.KW.H.max' 'test.Median.min' 'test.Median.min.log10' 'overlapping.F1.mean' 'overlapping.F1.sd' 'overlapping.F1v.mean' 'neighborhood.N3.mean' 'neighborhood.N3.sd' 'linearity.L1.mean' 'linearity.L1.sd'] Coef and Intercept: [ 2.76916359e-01 1.60863973e-03 -5.91476791e-02 -3.66022732e+00 -1.24847056e+00 -1.02191720e+00 1.70679775e+00 -9.41727982e-02 -1.82252403e-01 1.11443118e-01 -3.03151515e-03 -2.07351170e-04 -3.43927288e+00 1.10905381e-03 -1.06687646e-02 8.06387989e-01 4.64748456e+00 1.04938891e+00 -3.42747178e+00 -1.47822201e+00 -8.88982547e-03 9.25551581e-01 2.31739150e-01 3.04947082e+00 -2.05232016e+00 1.84737218e+00 1.18738669e+00 2.37477337e+00] -10.168545362724334 c = 1 , in-class unified metric = [-4.83674827] c = 2 , in-class unified metric = [-5.28345614]
(array([-3.62766367]), [array([-4.83674827]), array([-5.28345614])], '20221127191822.217812.pkl')
from cla.unify import analyze
analyze(X,y, use_filter = True, pkl = '20221127191822.217812.pkl', method = 'meta.logistic')
Load atom metrics from 20221127191822.217812.pkl before filter
Metrics above the threshold (0.5): ['classification.BER' 'classification.SVM.Margin' 'correlation.IG.max' 'correlation.r.max' 'correlation.rho.max' 'correlation.tau.max' 'test.ES.max' 'test.student.min.log10' 'test.student.T.max' 'test.ANOVA.min.log10' 'test.ANOVA.F.max' 'test.MANOVA.F' 'test.MWW.min.log10' 'test.MWW.U.min' 'test.KS.min.log10' 'test.KS.D.max' 'test.CHISQ.min.log10' 'test.CHISQ.CHI2.max' 'test.KW.min.log10' 'test.KW.H.max' 'test.Median.min' 'test.Median.min.log10' 'overlapping.F1.mean' 'overlapping.F1.sd' 'overlapping.F1v.mean' 'neighborhood.N3.mean' 'neighborhood.N3.sd' 'linearity.L1.mean' 'linearity.L1.sd'] Score: 1.0 Coef and Intercept: [[ 1.35800829e-08 1.79080650e-02 2.27082498e-05 2.68941122e-05 2.76123428e-05 2.26016767e-05 1.11499735e-04 -2.06499142e-03 -5.61227832e-04 -2.06887367e-03 1.91052481e-02 1.35331494e-02 -1.19610027e-03 -2.73930910e-02 -1.89001196e-03 3.23139517e-05 -1.80724310e-04 7.22407390e-04 -1.19630776e-03 5.35223776e-03 3.71357695e-03 2.72674190e-05 3.00809899e-08 1.98636069e-05 3.30494068e-06 5.05136354e-07 1.85420103e-06 3.70840205e-06]] [2.64054056e-05] c = 1 , in-class unified metric = 0.0001808874752863014 c = 2 , in-class unified metric = 0.0007227458921617193
(4.341332031428809e-13, [0.0001808874752863014, 0.0007227458921617193], '20221127191822.217812.pkl')
from cla.unify import analyze
analyze(X,y, use_filter = True, pkl = '20221127191822.217812.pkl', method = 'decompose.pca')
Load atom metrics from 20221127191822.217812.pkl before filter
Metrics above the threshold (0.5): ['classification.BER' 'classification.SVM.Margin' 'correlation.IG.max' 'correlation.r.max' 'correlation.rho.max' 'correlation.tau.max' 'test.ES.max' 'test.student.min.log10' 'test.student.T.max' 'test.ANOVA.min.log10' 'test.ANOVA.F.max' 'test.MANOVA.F' 'test.MWW.min.log10' 'test.MWW.U.min' 'test.KS.min.log10' 'test.KS.D.max' 'test.CHISQ.min.log10' 'test.CHISQ.CHI2.max' 'test.KW.min.log10' 'test.KW.H.max' 'test.Median.min' 'test.Median.min.log10' 'overlapping.F1.mean' 'overlapping.F1.sd' 'overlapping.F1v.mean' 'neighborhood.N3.mean' 'neighborhood.N3.sd' 'linearity.L1.mean' 'linearity.L1.sd']
Explained Variance Ratios for the first three PCs [8.26779438e-01 1.72092940e-01 6.23388247e-04] c = 1 , in-class unified metric = 518.018235293348 c = 2 , in-class unified metric = 529.3232315241899 before scaling: 1078.5798570230465 [518.018235293348, 529.3232315241899] PC1 range: -1952.4088786029863 3117.744458116934 after scaling: 0.4021899271419477 [0.51275101 0.51052129]
(0.4021899271419477, array([0.51275101, 0.51052129]), '20221127191822.217812.pkl')
from cla.unify import analyze
analyze(X,y, use_filter = True, pkl = '20221127191822.217812.pkl', method = 'decompose.lda')
Unable to determine R home: [WinError 2] The system cannot find the file specified
rpy2 3.X may not support Windows. ECoL metrics may not be available. Load atom metrics from 20221127191822.217812.pkl before filter
Metrics above the threshold (0.5): ['classification.BER' 'classification.SVM.Margin' 'correlation.IG.max' 'correlation.r.max' 'correlation.rho.max' 'correlation.tau.max' 'test.ES.max' 'test.student.min.log10' 'test.student.T.max' 'test.ANOVA.min.log10' 'test.ANOVA.F.max' 'test.MANOVA.F' 'test.MWW.min.log10' 'test.MWW.U.min' 'test.KS.min.log10' 'test.KS.D.max' 'test.CHISQ.min.log10' 'test.CHISQ.CHI2.max' 'test.KW.min.log10' 'test.KW.H.max' 'test.Median.min' 'test.Median.min.log10' 'overlapping.F1.mean' 'overlapping.F1.sd' 'overlapping.F1v.mean' 'neighborhood.N3.mean' 'neighborhood.N3.sd' 'linearity.L1.mean' 'linearity.L1.sd']
Explained Variance Ratios for the first component 1.0
Exception ignored from cffi callback <function _consolewrite_ex at 0x000002902A859AF0>:
Traceback (most recent call last):
File "C:\Users\eleve\anaconda3\lib\site-packages\rpy2\rinterface_lib\callbacks.py", line 133, in _consolewrite_ex
s = conversion._cchar_to_str_with_maxlen(buf, n, _CCHAR_ENCODING)
File "C:\Users\eleve\anaconda3\lib\site-packages\rpy2\rinterface_lib\conversion.py", line 138, in _cchar_to_str_with_maxlen
s = ffi.string(c, maxlen).decode(encoding)
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xd4 in position 0: invalid continuation byte
c = 1 , in-class unified metric = -118.84413572036078 c = 2 , in-class unified metric = -129.31694369214372 before scaling: -56.81520419251131 [-118.84413572036078, -129.31694369214372] C1 range: -11.518618553463384 11.1203427310966 after scaling: 1.0 [1. 1.]
(1.0, array([1., 1.]), '20221127191822.217812.pkl')
相比于前一个数据集,这个数据集可分性较差
import sklearn
import scipy
import statsmodels
import seaborn
import numpy
import pandas
import qsi
import clams
print(sklearn.__version__,
scipy.__version__,
statsmodels.__version__,
seaborn.__version__,
numpy.__version__,
pandas.__version__,
qsi.__version__,
clams.__version__)
1.0.2 1.7.3 0.13.2 0.11.2 1.21.5 1.4.2